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Abstract. We consider a generic convex-concave saddle point problem 
with separable structure, a form that covers a wide-ranged machine learn¬ 
ing applications. Under this problem structure, we follow the frame¬ 
work of primal-dual updates for saddle point problems, and incorpo¬ 
rate stochastic block coordinate descent with adaptive stepsize into this 
framework. We theoretically show that our proposal of adaptive stepsize 
potentially achieves a sharper linear convergence rate compared with the 
existing methods. Additionally, since we can select “mini-batch” of block 
coordinates to update, our method is also amenable to parallel process¬ 
ing for large-scale data. We apply the proposed method to regularized 
empirical risk minimization and show that it performs comparably or, 
more often, better than state-of-the-art methods on both synthetic and 
real-world data sets. 

Keywords: large-scale optimization, parallel optimization, stochastic 
coordinate descent, convex-concave saddle point problems 


1 Introduction 

The generic convex-concave saddle point problem is written as 

min inax {L(x, y) = g{x) + (x, Ky) - <^*(y)} , (1) 

xeK'^ yGR’ 

where g{:x.) is a proper convex function, 4>*{-) is the convex conjugate of a convex 
function (j>{-), and matrix K £ Many machine learning tasks reduce to 

solving a problem of this form [613] . As a result, this saddle problem has been 
widely studied |16I14I2I1I4I5] . 

One important subclass of the general convex concave saddle point problem 
is where g{x) or (j)*{y) exhibits an additive separable structure. We say ^i*(y) 
is separable when ^*(y) = ^ ^"= 1 (y0> with y^ £ ]R«S and = q. 

Separability for g{-) is defined likewise. To keep the consistent notation for the 
machine learning applications discussed later, we introduce matrix A and let 
K = iA. Then we partition matrix A into n column blocks A^ £ i = 

1, ..., n, and Ky = ^ resulting in a problem of the form 

min max i L(x,y) = g{x) -b ^ ^ ((x, A,y,) - \ (2) 
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for (/)*(•) separable. We call any problem of the form (HD where g{-) or 
has separable structure, a Separable Convex Concave Saddle Point (Sep-CCSP) 
problem. Eq. (ED gives the explicit form for when is separable. 

In this work, we further assume that each (p* (y^) is 7 -strongly convex, and 
( 7 (x) is A-strongly convex, i.e., 

PKv'i) > -y*) + |||y- -y^Ha, Vy^,y' S m 

g(x') > g(x) + Vg(x)^ (x' - x) + ^||x' - x,\\l, Vx,x' e 

where we use V to denote both the gradient for smooth function and subgradient 
for non-smooth function. When the strong convexity cannot be satisfied, a small 
strongly convex perturbation can be added to make the problem satisfy the 
assumption m- 

One important instantiation of the Sep-CCSP problem in machine learning 
is the regularized empirical risk minimization (ERM, [3]) of linear predictors, 

min I J(x) =-^(^i(afx)+g(x) I , (3) 

xeR"^ I n I 

where ai,...,a„ £ are the feature vectors of n data samples, 0 i(-) corre¬ 
sponds the convex loss function w.r.t. the linear predictor a^x, and g(x) is a 
convex regularization term. Many practical classification and regression mod¬ 
els fall into this regularized ERM formulation, such as linear support vector 
machine (SVM), regularized logistic regression and ridge regression, see [3] for 
more details. 

Reformulating the above regularized ERM by employing conjugate dual of 
the function i.e. 

(af x) = max (x, y,a.,) - p* {yi), (4) 

yiGR 

leads directly to the following Sep-CCSP problem 

1 " 

5(x) + -y] ((x, yi^i) - (p*{yi )). (5) 

xGR'i yeR” n 

i—1 


Comparing with the general form, we note that the matrix in m is now a 
vector a^. For solving the general saddle point problem (HD, many primal-dual 
algorithms can be applied, such as [1612111415] . In addition, the saddle point 
problem we consider can also be formulated as a composite function minimiza¬ 
tion problem and then solved by Alternating Direction Method of Multipliers 
(ADMM) methods [Sj. 

To handle the Sep-CCSP problem particularly for regularized ERM problem 
®, Zhang and Xiao HU proposed a stochastic primal-dual coordinate descent 
(SPDC) method. SPDC applies stochastic coordinate descent method [8110111] 
into the primal-dual framework, where in each iteration a random subset of dual 





Adaptive Stochastic Primal-Dual Coordinate Descent 


3 


coordinates are updated. This method inherits the efficiency of stochastic coor¬ 
dinate descent for solving large-scale problems. However, they use a conservative 
constant stepsize during the primal-dual updates, which leads to an unsatisfying 
convergence rate especially for unnormalized data. 

In this work, we propose an adaptive stochastic primal-dual coordinate de¬ 
scent {AdaSPDC) method for solving the Sep-CCSP problem ([2]), which is a 
non-trivial extension of SPDC. By carefully exploiting the structure of individ¬ 
ual subproblem, we propose an adaptive stepsize rule for both primal and dual 
updates according to the chosen subset of coordinate blocks in each iteration. 
Both theoretically and empirically, we show that AdaSPDC could yield a signif¬ 
icantly better convergence performance than SPDC and other state-of-the-art 
methods. 

The remaining structure of the paper is as follows. Section [5] summarizes the 
general primal-dual framework our method and SPDC are based on. Then we 
elaborate our method AdaSPDC in Section [31 where both the theoretical result 
and its comparison with SPDC are provided. In Section m we apply our method 
into regularized ERM tasks, and experiment with both synthetic and real-world 
datasets, and we show the superiority of AdaSPDC over other competitive meth¬ 
ods empirically. Finally, Section [5] concludes the work. 

2 Primal-dual Framework for Convex-Concave Saddle 
Point Problems 

Chambolle and Pock [T] proposed a first-order primal-dual method for the CCSP 
problem O- We refer this algorithm as PDCP. The update of PDCP in the 


{t + l)th iteration is as follows: 

y*+i = argminy(/)*(y) - (x‘,Ky) -k ^Ily-y‘ll 2 (6) 

x‘+i = argmin,g(x) + (x, Ky*+i) + ^l|x - x‘||2 (7) 

x*+i =x‘+^-k6»(x*+^-X*). (8) 

When the parameter configuration satisfies ra < l/||K|p and 6 = 1, PDCP 


could achieve 0(1/T) convergence rate for general convex function and g{-), 
where T is total number of iterations. When and g{-) are both strongly 

convex, a linear convergence rate can be achieved by using a more scheduled 
stepsize. PDCP is a batch method and non-stochastic, i.e., it has to update 
all the dual coordinates in each iteration for Sep-CCSP problem, which will be 
computationally intensive for large-scale (high-dimensional) problems. 

SPDC [T3] can be viewed as a stochastic variant of the batch method PDCP 
for handling Sep-CCSP problem. However, SPDC uses a conservative constant 
stepsize for primal and dual updates. Both PDCP and SPDC do not consider 
the structure of matrix K and only apply constant stepsize for all coordinates 
of primal and dual variables. This might limit their convergence performance in 
reality. 
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Based on this observation, we exploit the structure of matrix K (i.e., ;^A) 
and propose an adaptive stepsize rule for efficiently solving Sep-CCSP problem. 
A better linear convergence rate could be yielded when (•) and g{-) are strongly 
convex. Our algorithm will be elaborated in the following section. 


3 Adaptive Stochastic Primal-Dual Coordinate Descent 


As a non-trivial extension of SPDC [15], our method AdaSPDC solves the Sep- 
CCSP problem d^j) by using an adaptive parameter configuration. Concretely, 
we optimize L(x, y) by alternatively updating the dual and primal variables in 
a principled way. Thanks to the separable structure of ^(y), in each iteration we 
can randomly select m blocks of dual variables whose indices are denoted as St, 
and then we only update these selected blocks in the following way, 


y*+i = argminy. 


(j)i{y,) - (x*,A,y,) -t 


1 

2ai 



if i G St. 


(9) 


For those coordinates in blocks not selected, i ^ St, we just keep y-'*'^ = y^ By 
exploiting the structure of individual A^, we configure the stepsize parameter of 
the proximal term at adaptively 



where Ri = || A^ ||2 = y ^max (Af A^), with || • jh is the spectral norm of a matrix 
and 4tmax(’) to denote the maximum singular value of a matrix. 

Our step size is different from the one used in SPDC where i? is a 
constant R = max{||ai ||2 : i = l,...,n} (since SPDC only considers ERM 
problem, the matrix At is a feature vector a^). 

Remark. Intuitively, Ri in AdaSPDC can be understood as the coupling 
strength between the i-th dual variable block and primal variable, measured 
by the spectral norm of matrix A^. Smaller coupling strength allows us to use 
larger stepsize for the current dual variable block without caring too much about 
its influence on primal variable, and vice versa. Compared with SPDC, our pro¬ 
posal of an adaptive coupling strength for the chosen coordinate block directly 
results in larger step size, and thus helps to improve convergence speed. 

In the stochastic dual update, we also use an intermediate variable x* as in 
PDCP algorithm, and we will describe its update later. 

Since we assume 5 (x) is not separable, we update the primal variable as a 
whole. 


x*^^ = argmiUj, 


g(x) -k ( x,r‘ -k 


-E 

m ^ 

jeSt 


A,(yr'-y‘)) + 




(11) 
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Algorithm 1 AdaSPDC for Separable Convex-Concave Saddle Point Problems 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 


Input: number of blocks picked in each iteration m and number of iterations T. 
Initialize: x°, y°, x° = x°, r° = i ^iYi 

for t = 0, 1,..., r — 1 do 


Randomly pick m dual coordinate blocks from {1 ,..., n} as indices set St, with 
the probability of each block being selected equal to m/n. 

According to the selected subset St, compute the adaptive parameter configura¬ 
tion of at, r* and 0* using Eq. (O, inii and respectively, 
for each selected block in parallel do 

Update the dual variable block using Eq. (Ell- 
end for 

Update primal variable using Eq. (HB. 

Extrapolate primal variable block using Eq. GS- 
Update the auxiliary variable r using Eq. m- 

end for 


The proximal parameter r* is also configured adaptively, 



( 12 ) 


where i?max = max{i?i|i € St}, compared with constant R used in SPDC. To 
account for the incremental change after the latest dual update, an auxiliary 
variable r* = ^ Sr=i updated as follows 


rt+i = j.t 


+ sEAhyr‘-d)- 

” J€S, 


(13) 


Finally, we update the intermediate variable x, which implements an extrapola¬ 
tion step over the current x*®"^ and can help to provide faster convergence rate 

m- 

x‘+^ = X*+^ -b 6»*(x*+^ - X*), (14) 

where 0 * is configured adaptively as 


n/m + i?(nax\/Ww)/(^7)’ 

which is contrary to the constant 0 used in SPDC. 

The whole procedure for solving Sep-CCSP problem ([^l using AdaSPDC 
is summarized in Algorithm [TJ There are several notable characteristics of our 
algorithms. 

- Compared with SPDC, our method uses adaptive step size to obtain faster 
convergence (will be shown in Theorem [T]), while the whole algorithm does 
not bring any other extra computational complexity. As demonstrated in the 
experiment Section|Tl in many cases, AdaSPDC provides significantly better 
performance than SPDC. 
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— Since, in each iteration, a number of block coordinates can be chosen and 
updated independently (with independent evaluation of individual step size), 
this directly enables parallel processing, and hence use on modern computing 
clusters. The ability to select an arbitrary number of blocks can help to make 
use of all the computation structure available as effectively as possible. 


3.1 Convergence Analysis 

We characterise the convergence performance of our method in the following 
theorem. 

Theorem 1. Assume that each is ^-strongly convex, and g(-) is X-strongly 
convex, and given the parameter configuration in Eg. and m, then 

after T iterations in Algorithm [IJ the algorithm achieves the following conver¬ 
gence performance 



where (x*,y*) is the optimal saddle point, Vi = jy' = V(2£t)±j_^ 

l|y^-y*ll^ = Er=i'^dlyf-y*lll- 


Since the proof of the above is technical, we provide it in the Supplementary 
Material. 

In our proof, given the proposed parameter 0*, the critical point for obtaining 
a sharper linear convergence rate than SPDC is that we configure and ai as 
Eq. (HU) and m to guarantee the positive definiteness of the following matrix 
in the t-th iteration. 


P = 




1 , 
2diag(<TSt). 


(17) 


where As^ = [..., A, ,... ] S a^d diag(<TsJ = diag(..., tTilg.,...) for z S 

St- However, we found that the parameter configuration to guarantee the positive 
definiteness of P is not unique, and there exist other valid parameter configura¬ 
tions besides the proposed one in this work. We leave the further investigation 
on other potential parameter configurations as future work. 


3.2 More Comparison with SDPC 

Compared with SPDC [TS], AdaSPDC follows the similar primal-dual frame¬ 
work. The crucial difference between them is that AdaSPDC proposes a larger 
stepsize for both dual and primal updates, see Eq. (US and (IT^ compared with 
SPDC’s parameter configuration given in Eq.(lO) in [IS], where SPDC applies 
a large constant R = max{||ai ||2 : i = 1,..., n} while AdaSPDC uses a more 
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adaptive value of Ri and Rl^ax foi' iteration to account for the different cou¬ 
pling strength between the selected dual coordinate block and primal variable. 
This difference directly means that AdaSPDC can potentially obtain a signif¬ 
icantly sharper linear convergence rate than SPDC, since the decay factor 0* 
of AdaSPDC is smaller than 9 in SPDC (Eq.(lO) in [TS]) , see Theorem [T] for 
AdaSPDC compared with SPDC (Theorem 1 in [H]). The empirical performance 
of the two algorithms will be demonstrated in the experimental Section 0) 

To mitigate the problem that SPDC uses a large R, the authors of SPDC 
proposes to non-uniformly sample the the dual coordinate to update in each iter¬ 
ation according to the norm of the each a^. However, as we show later in the em¬ 
pirical experiments, this non-uniform sampling does not work very well for some 
datasets. By configuring the adaptive stepsize explicitly, our method AdaSPDC 
provides a better solution for unnormalized data compared with SPDC, see Sec¬ 
tion!?] for more empirical evidence. 

Another difference is that SPDC only considers the regularized ERM task, 
i.e., only handling the case that each is a feature vector a^, while AdaSPDC 
extends that A^ can be a matrix so that AdaSPDC can cover a wider range of 
applications than SPDC, i.e. in each iteration, a number of block coordinates 
could be selected while for SPDC only a number of coordinates are allowed. 



(a) A = 10-® (b) A = 10“* 



(c) A = 10"® (d) A = 10"® 


Fig. 1. Ridge regression with synthetic data: comparison of convergence performance 
w.r.t. the number of passes. Problem size: d — 1000, n = 1000. We evaluate the con¬ 
vergence performance using objective suboptimality, J(x*) — J(x*). 
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4 Empirical Results 

In this section, we appy AdaSPDC to several regularized empirical risk min¬ 
imization problems. The experiments are conducted to compare our method 
AdaSPDC with other competitive stochastic optimization methods, including 
SDCA [13], SAG [H], SPDC with uniform sampling and non-uniform sampling 
m- In order to provide a fair comparison with these methods, in each itera¬ 
tion only one dual coordinate (or data instance) is chosen, i.e., we run all the 
methods sequentially. To obtain results that are independent of the practical im¬ 
plementation of the algorithm, we measure the algorithm performance in term 
of objective suboptimality w.r.t. the effective passes to the entire data set. 

Each experiment is run 10 times and the average results are reported to show 
statistical consistency. We present all the experimental results we have done for 
each application. 


4.1 Ridge Regression 

We firstly apply our method AdaSPDC into a simple ridge regression problem 
with synthetic data. The data is generated in the same way as Zhang and Xiao 
US); n = 1000 i.i.d. training points are generated in the following 

manner, 

6 = -I- e, a^A/'(0,S), e^A/'(0,1), 

where a G and d = 1000, and the elements of the vector x* are all ones. The 
covariance matrix S is set to be diagonal with Ejj = j~'^, for j = 1,... ,d. Then 
the ridge regression tries to solve the following optimization problem, 

mm I J(x) = 1 g i(afx - . (18) 

The optimal solution of the above ridge regression can be found as 

X* = (AA^ + nAId)”^Ab. 

By employing the conjugate dual of quadratic loss (crossref, Eq. (01)), we can 
reformulate the ridge regression as the following Sep-CCSP problem, 

xeR'* yeH" 2 n V V ^ / / 

It is easy to figure out that g{x) = A/2||x||| is A-strongly convex, and </i'*(?/i) = 
^Ui + hVi is 1-strongly convex. 
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Thus, for ridge regression, the dual update in Eq. (jH]) and primal update in 
Eq. (HU of AdaSPDC have closed form solutions as below. 


= 


x‘+i = 


1 


1 + l/Ci 
1 

A +1/7 


- ( (x*,aA +bi + —j/i ) , if 7 G S'* 


rx*- 


E - 


16 St 



The algorithm performance is evaluated in term of objective suboptimal¬ 
ity (measured by J(x*) — J(x*)) w.r.t. number of effective passes to the en¬ 
tire datasets. Varying values of regularization parameter A are experimented 
to demonstrate algorithm performance with different degree of ill-conditioning, 
A = {10-^10-'‘,10-^10-®}. 

Fig- [I] shows algorithm performance with different degrees of regularization. 
It is easy to observe that AdaSPDC converges substantially faster than other 
compared methods, particularly for ill-conditioned problems. Compared with 
SPDC and its variant with non-uniform sampling, the usage of adaptive stepsize 
in AdaSPDC significantly improves convergence speed. For instance, in the case 
with A = 10“®, AdaSPDC achieves 100 times better suboptimality than both 
SPDC and its variant SPDC with non-uniform sampling after 300 passes. 


Table 1. Benchmark datasets used in our experiments for binary classihcation. 


Datasets 

Number of samples 

Number of features 

Sparsity 

w8a 

49,749 

300 

3.9% 

covertype 

20,242 

47,236 

0.16% 

url 

2,396,130 

3,231,961 

0.0018% 

quantum 

50,000 

78 

43.44% 

protein 

145,751 

74 

99.21% 


4.2 Binary Classification on Real-world Datasets 

We now compare the performance of our method AdaSPDC with other com¬ 
petitive methods on several real-world data sets. Our experiments focus on the 
freely-available benchmark data sets for binary classification, whose detailed in¬ 
formation are listed in Table [TJ The wSa, covertype and url data are obtained 
from the LIBSVM collectior0. The quantum and protein data sets are obtained 
from KDD Cup 20041. For all the datasets, each sample takes the form {a.i,bi) 
with Hi is the feature vector and bi is the binary label —1 or 1. We add a bias term 

^ http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html 
^ http://osmot.cs.Cornell.edu/kddcup/datasets.html 
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h 10° d 

lO’^l 

7 10^- 


10° - 

lO’^l 

7 10"'- 
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Fig. 2. Comparison of algorithm performance with smooth Hinge loss. 


to the feature vector for all the datasets. We 
empirical risk with following form 


aim to minimize the regularized 


J(x) 



2=1 


( 20 ) 
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Fig. 3. Comparison of algorithm performance with Logistic loss. 


To provide a more comprehensive comparison between these methods, we exper¬ 
iment with two different loss function smooth Hinge loss m and logistic 
loss, described in the following. 
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Smooth Hinge loss (with smoothing parameter 7 = 1 .) 

{ 0 if biZ > 1 , 

1- ^ - btZ if < 1 - 7 
^(1 — biz)"^ otherwise. 

And its conjugate dual is 

with biyi £ [- 1 , 0 ]. 

We can observe that (j)*{yi) is 7 -strongly convex with 7 = 1 . The dual update of 
AdaSPDC for smooth Hinge loss is nearly the same with ridge regression except 
the necessity of projection into the interval biyi S [— 1 , 0 ]. 


Logistic loss 


(j)^{z) = log (1 + exp{-b,z )), 


whose conjugate dual has the form 


= -biVi^ogi-biyi) + (1 + biyi)\og{l + biyi) with b^y^ £ [- 1 , 0 ]. 

It is also easy to obtain that (p* (yi) is 7 -strongly convex with 7 = 4 . Note that 
for logistic loss, the dual update in Eq. m does not have a closed form solution, 
and we can start from some initial solution and further apply several steps of 
Newton’s update to obtain a more accurate solution. 

During the experiments, we observe that the performance of SAG is very 
sensitive to the stepsize choice. To obtain best results of SAG, we try different 
choices of stepsize in the interval [1/16L, 1 /L] and report the best result for 
each dataset, where L is Lipschitz constant of (j>i{afx), 1/16L is the theoretical 
stepsize choice for SAG and 1/L is the suggested empirical choice [12]. For 
smooth Hinge loss, L = maxi{jjaijj 2 ,i = l,...,n}, and for logistic loss, L = 
imaxi{llail| 2 ,i = 1 ,... ,n}. 

Fig. [2] and Fig. [3] depict the algorithm performance on the different methods 
with smooth Hinge loss and logistics loss, respectively. We compare all these 
methods with different values of A = {10“®, 10“®, 10“^}. Generally, our method 
AdaSPDC performs consistently better or at least comparably with other meth¬ 
ods, and performs especially well for the tasks with small regularized parameter 
A. For some datasets, such as covertype and quantum, SPDC with non-uniform 
sampling decreases the objective faster than other methods in early epochs, 
however, cannot achieve comparable results with other methods in later epochs, 
which might be caused by its conservative stepsize. 

5 Conclusion & Future Work 

In this work, we propose Adaptive Stochastic Primal-Dual Coordinate Descent 
(AdaSPDC) for separable saddle point problems. As a non-trivial extension of 
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a recent work SPDC AdaSPDC uses an adaptive step size choices for both 
primal and dual updates in each iteration. The design of the step size for our 
method AdaSPDC explicitly and adaptively models the coupling strength be¬ 
tween chosen block coordinates and primal variable through the spectral norm 
of each A^. We theoretically characterise that AdaSPDC holds a sharper linear 
convergence rate than SDPC. Additionally, we demonstrate the superiority of the 
proposed AdaSPDC method on ERM problems through extensive experiments 
on both synthetic and real-world data sets. 

An immediate further research direction is to investigate other valid param¬ 
eter configurations for the extrapolation parameter 6, and the primal and dual 
step sizes r and a both theoretically and empirically. In addition, discovering 
the potential theoretical connections with other stochastic optimization methods 
will also be enlightening. 


Acknowledgments. Z. Zhu is supported by China Scholarship Council/University 
of Edinburgh Joint Scholarship. The authors would like to thank Jinli Hu for 
insightful discussion on the proof of Theorem 1. 
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Appendix: Proofs 


Before presenting the proof of Theorem 1, we firstly provide the following lemma 
and its proof, which characterizes positive semi-definiteness of an important 
matrix used in the proof of Theorem 1. 


Lemma 1. Given any matrix K G we partition the matrix K into J col¬ 
umn blocks, Kj G ^ j = 1,..., J, and then then define 

two diagonal matrices, U = ul G and'V = (uilmi, t' 2 lm 2 > • ■ • G 

Y^- = Vjlmj- And denote Rj = ||Kj ||2 = ■\JUmax (KJKj), where 

11-112 is the spectral norm and /imax(’) is the maximum singular value of a matrix. 
And let i?max = max {Rj \j = 1,..., J}. Now we consider the following parameter 
configuration, for any positive constant c > 0, 

j = ( 21 ) 

Kj 


u = 


1 


cJR 


max 


( 22 ) 


Under the above parameter configuration, the following matrix is positive definite, 

U-i -K 
-K^ V-i 


P = 


Y 0. 


(23) 


Proof. Firstly consider each separable column block Kj, then 



For any x G R'*, Yj G R'"-’, we consider 


- 2(x,K,y,) = -2(U-^X,U5K,V|V^. iy,). (25) 

Applying the Cauchy-Schwarz inequality and the fact that 2ab < hafi J- /h for 
any a, b and ft. > 0, we obtain, 

-2(x,K,y,) > -2||U-5x||2||U^K,v|v-V,|l2 

> - (^i(x,U-ix) + ft||U^K,v|||2(y^.,V-V,)) 


(26) 

(27) 
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In view of the inequality (IMl) . it is obvious that there exists certain e > 0 such 
that the following equality holds, 

(J + e)(l + e)||U^K,Vj||2 = l. (28) 

Thanks to this equality, now we set h = J -\- e, and the inequality (EID can be 
further simplified, 


- 2 (x, Kjyj) > 



(x, U ^x) -f 


iJ + e)\\UiK,V]\\l 

(J + e)(l + e)||UiK,v||| 


-(yj,v. Vi) 



(x, U ^x) -I- 



(29) 

(30) 


Let y = (yi,..., yj) G K™, and now we consider for any non-zero (x, y) G 
the following inner product can be expanded, 

.7 ,7 

((x,y),P(x,y)) = (x,U-^x)-f ^(yj,V-Vj) - 2^(x,Kjyj). (31) 

i=i i=i 

Inserting the inequality (|5(I)) into the above equation, we obtain. 


i=i 


,7 

-E 

i=i 


T-i 


•/-be 


J + e 


(x, U ^x) -b 


<1 

1 

V 

(32) 

'x) + i^^(y„v/y,)^ 

(33) 

i + >0- 

(34) 


which guarantees the positive definiteness of the matrix P. 

Now we are ready to proof the Theorem [T] in our paper: 

Proof. Firstly, we analyze the value of the dual variable y after t-th update in 
Algorithm 1. For any i G {1,2,..., n}, let yt be the value of yV^ if f G S't, i.e.. 


y* = argmin <()*(y,) - (x*, A,y,) + ^||yi - y\\\l 

^ 2 ct ,; 


(35) 


Since <('*(•) is 7 -strongly convex, thus the function to be minimized above is 
(l/ui -b 7 )-strongly convex. Then we have, 

- (x‘, A,y*) -b ;^||y* -y ‘||2 > V*(yi) - (x‘, A,y,) -b ;^||y* - y ‘||2 

Z(Ji 


+ 


I|y*-y*ll2 


(36) 
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Since the (x*,y*) is the saddle point, we can obtain following inequality, 


4>*{Yi) - (x*, A,y,) > (j)* (y*) - (x*, A,y*) 


(37) 


Adding the two inequalities together, we have 



> 



l|y* -y*ll: 


SfTi 


\y^ - ylWl + (x* - x‘, Ai {yi - y*)) 


. 

In our algorithm, an index set St is randomly chosen. For every specific index 
i, the event i G St happens with probability m/n. If i G St, then y*^^ is updated 
to the value y-. Otherwise, y-^^ is kept to be its old value y*. Let ^t be the 
random event that contains the set of all random variable before round t, 


^t = {SuS2,...,St}, 


(39) 


and then we have 


y‘+^-y*ll^] 


% [l|y‘+'-y!lll 


m, 


-||y.-y. 


*||2 


n — m, 


y‘-y*ll2 


= -rlly* 


n 

m. 


% [y^^] = iy^ + 


y‘lli 


n — m 


Consequently, we can insert the representations of ||yi — y^lH, ||yi — ylHi and 
yi in terms of the above expectations into the inequality (1551) . 



n — m \ 

-7 

m J 



Then we add the above inequality from i = 1, 2,..., n, and divide both sides by 
n, and obtain 


l|y‘-y*llM>% 


yX^ 


Lily 


i+1 




X* - x‘, 


1 


V 

m 

3^S. 


+ (yr^-y‘) 


(41) 


where Hi = 


-7, Mi = 


= ^Er=iA*y*, and u* = 


- X)r=i ■^^y^ la the crossing term between primal and dual variable, we use the 
fact that Yh=i ^^{yX - yI) = Ejes* Aj(y‘+^ - y‘) since only the blocks in 
index set St are chosen and updated in t-th update. 
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Now we characterize the t-th update of primal variable x. Following the same 
derivation for dual variable and using the assumption that g{-) is A-strongly 
convex, we can easily obtain 

^ll-‘ - -111 S (i- + ■') 

+ (x‘« - x<.u< - u* + 1 ^ A, (y<« -y‘)). (42) 

\ jeSt / 


Taking expectation over both sides of the above inequality and adding it to the 
the inequality m. then we have 



+ % 


-x*||^+||y‘-y*||^> + [||x‘+1-x’^||^]+E^, [||y‘+i 

+ - ^12] + [lly‘+' - y‘ll^] 

/x*+i _ _ 0* (x‘ - x‘-i) , A f i(y‘ - y’^) + l(y‘+i - y*)) 
\ \n m ) 



(43) 


where the matrix A = [Ai, A 2 ,..., A„] G 

Now we focus on the most crucial part of the proof: bounding the last term 
of R.H.S. of the above inequality Firstly we rearrange this crossing term 
as follows, 


- x‘ - (x‘ - x‘-i) , A Q(y‘ - y*) + l(y‘+i - y‘)) ^ 

= i (x‘+i - X*, A (y‘+i - y*)) - ^ (x‘ - x‘-\ A (y‘ - y*)) 
n n 

+ (x‘+i - x‘, A (y*+i - y‘)) - - (x* - x‘-i, A (y*+i - y*)) . (44) 

ran m 

Given the parameter configuration in Eq m and m, we consider the following 
symmetric matrix, 


P = 


m T 

■^St 2diag((TSt) 


-A 54 
1 


(45) 


Applying the Lemma 1, we can guarantee the positive definiteness of the matrix 
P, which naturally leads the following inequality. 


x‘+l 


4t* 


— X 


E 

ieSt 


5^1 


yr -y:ill> 




— X 


.E 

ieSt 


A. (y‘+^ - y‘) 


(46) 
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Similarly, we can also obtain 

^||X-+' - ^ A||y.« -y‘g>- (x-t. - X-, A. {y‘« - y')' 


ieSt 


ieSt 


(47) 

Taking the expectation for both sides of the above two equalities and using the 
facts that 




— ||x‘+l-X*||2 + ||y‘+l-y‘f 

L4T 4diag(o-) J 




ieSt 


\\\l 


Ej, [|(x‘+i-x*,A(y‘+i-y*))|] 

= E5, 


x‘+i-x‘,^A.(y‘+i-y‘) 

ieSt / 


we have that 

Ej, [|(x‘+i-x‘,A(y‘+i-y 

Similarly, we can obtain 
E,J|(x‘-x‘-i,A(y*+i-y^ 

Therefore, 


<E6 


<E5, 


m 


fy||x‘+i-x‘||i + ||y‘+i-y‘f_. 

L4r 4dlag(<7-) J 

(48) 


.4r^ 


X — X 


+ l|y‘+'-y‘f_i 

4diag(cr) 

(49) 


E^J(x‘+i-x‘,A(y‘+i-y‘))] >-E5, —1| 

-Ee, 

Ej, [(x‘-x‘-i,A(y‘+i-y‘))] >-E5, 

-Es, 


,„x‘+i-x*l|2 
-4t‘ 

||y*+l-y*f 

4diag(cr) 

m . 

Ar^ 




||y‘+l-y*||2 

4diag{cr) 


(50) 


(51) 


Now we insert the Eq. (l4^ into the inequality (143 1) . and then apply the two 
bounds o and (EH), we have 


A||x'-x*||^+||y'-y*||J> (A + a)E; 


-•«-x*|ia+Ef,[||y‘+‘-y*||J,] 


1 


[II 


x‘+i-x‘||21 


1 


+ [(x*+i-x‘,A(x‘+i-x‘))] 


n 


-|rl|x'-x'-‘||^^ {x- - x-‘, A (y' - y.))+ l [||y'+' - y'llJ] 
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Recall the configuration for 9* in Eq. (USD, the last term of R.H.S. of the above 
inequality is non-negative, and can be bounded away. Then we have the following, 

^||x‘-x*||i + ||y‘-y*||2+^||x*+x*-i||2 + ^(x‘-x‘-\A(y*-y*))> 


2t* 

(2^ + >) - -nil + [Ily‘*‘ - y'Wl] + [II-‘” - -'llil 


+ -Ef, [(x*+i - x‘, A (x‘+i - X*))] ^ (52) 


n 


According to the defined sequence we have 

=0‘ (^ + a) ||x* - x*||i + 0‘||y‘ - y*||2, 


-I 

4t‘ 


|x‘ + x‘-i||2 + -(x*-x‘-\A(y‘-y*)) 

n \ \ / / 


(53) 


According to the parameter configuration for r*, <Ji and 0*, we can easily verify 
that 




1 

2t* 


+ A > 


1 

2r‘ 


6»Vi > 


Combining these two inequalities with the inequality (15^ and Eq. (15^ . we have 

Consider t = 0,1,..., T, the above inequality implies 


\2tT 


-bA E ||x^-x 




■E[||y^'-y*||^£j+^E [||x--x 

/ T 




where 


+ iE[(x^-x^-\A(y^-y*))] < (^5) 

+ a) E [||x° - x*||2] + E [||y0 - y*||2,] . (56) 


Consider the following matrix 

Q = 


' 2^1 ±As, 

-L aT n 

St 2mdiag((T5^) 


(57) 


Applying the Lemma 1 again, we can guarantee the positive definiteness of the 
matrix, which implies that 





20 


Z. Zhu and A.J. Storkey 


Taking expectation 
1 


1 


E[|(x^-x^-\A(y^-y*))|] <;f^E|||x^ -x 


+ [l|y ~y lll/diag(<T)] 


Thus, 


,T- 1||21 


iE[(x^-x^ \A(y^-y*))] >-^Elii- -- ll2 

1 


X — X 

~ 4^®" [l|y l|l/diag((T)] 


Then combining the above inequality with inequality (1551) . we have 


(^^ + AjE[||x^-x*||^]+E[||y^-y*||^] 

^ fn ll^° - ^’^112 + lly° - y*\\l) > 


where and ||y^-y1|^ = Er=i - 

which completes the proof. 




